Part 1: Visualizing spatial data on maps

Let’s make data maps! This is a huge topic, once again, so we’ll keep things relatively simple for the purposes of this course. There are plenty of resources online if you’d like to dig further into plotting spatial data. But, now armed with your ggplot powers, you will be able to make some compelling and informative maps.

Many public datasets contain latitude and longitude (or other coordinate) columns, where each observation (row) has an exact geographic location associated with it. It can be straightforward to visualize these data on a map, but first you’ll need a map/geographic data source.

Some of the big map sources are Google Maps, and the Open Street Framework. Often, you’ll need to request an API key to use their maps, which is not difficult but it requires making an account, requesting a key, etc., and there are plenty of guides for how to do this online.

Stadia Maps is an easy-to-access resource for map data that pairs nicely with the ggmap library for R. It does require an API key, which involves setting up a free account. I’ve set up an account for our class already, and have included the API key in the code chunk below. Go ahead and run the code chunk to load up ggmap (you may need to install it first) and ggplot2, and connect to the Stadia Maps API.

library(ggmap)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## i Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## i Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggplot2) # should load in with ggmap automatically, but no harm in doing it explicitly

key = 'e7f97169-8de3-4e18-9683-5e08ee8affa3'
register_stadiamaps(key)

The key differences between ggmap and ggplot are: we use ggmap() to iniate the plot, and instead of a dataframe as a first argument, we pass in a map. With Stadia Maps, this comes from the get_stadiamap() function. We’ll then link our actual dataframe of interest to the map using aes() in a separate command (in the geom - later).

By default, get_stadiamap() returns a map of Houston, TX with a particular map style. Run ?get_stadiamap at the Console to see a list of different map styles you can choose from (though the documentation appears to be incomplete). Play around with different arguments to maptype = and find one you like - I am a fan of stamen_toner_lite.

ggmap(get_stadiamap(maptype = 'alidade_smooth_dark'))
## i © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.

Before we move forward, let’s load in a dataset that contains spatial coordinate data. For this Part, you can use the arrests dataset we used last time, OR you can use a dataset of your choice. If you want to use your own data, all I ask is that you make an informative, clean, good-looking map and that you write 2-3 sentences describing what the map shows. You can use the code chunks below as inspiration, but feel free to reorganize the rest of this section as it makes sense to you. Otherwise, follow along.

These map sources (including Stadia Maps) grant you access to a map of the entire planet. So, let’s limit our visualization range to that of the data. get_stadiamap() includes an argument for a “bounding box”. Think of this as a window into the geographic area we want to look at. For example, in the plot above, the bounding box ranges from -95.8 to about -94.9 degrees longitude, and 29.4 to about 30.1 degrees latitude.

# Use the arrests data provided, or your own dataset
arrests = read.csv('arrests.csv')

We could find a bounding box for our region manually, but perhaps a better way is to let the data inform us. Using max() and min(), complete the bounding box code below to determine the range of geographical coordinates we should draw for the NYPD arrests dataset.

# Make sure each line except the last one ends with a comma,
bbox = c(
  bottom = min(arrests$Latitude),
  top = max(arrests$Latitude),
  left = min(arrests$Longitude),
  right = max(arrests$Longitude)
)

# Call 'bbox' by name to check that you stored the coordinates
bbox
##    bottom       top      left     right 
##  40.49940  40.91272 -74.25184 -73.70161

Now you can pass in bbox as the first argument to get_stadiamap() and it will return a map bounded by those coordinates. There is also a zoom = argument that defines the resolution of the map. Run ggmap(get_stadiamap()) giving arguments bbox and a zoom of 11, with a maptype that you like. Assign the output to a variable called map.

While you’re at it, get rid of the ugly background grid by setting the panel.background to a blank element with element_blank() using theme(). The latitude/longitude exact degrees are not adding anything either, so get rid of them, too, in the same call to theme by setting axis.ticks, axis.text, and axis.title to element_blank(). An informative title would be good, though, so add one using labs(title = 'your title'). Note that the plot title is different from the axis titles (‘Latitude’ and ‘Longitude’)

# Complete the code here
map = ggmap(get_stadiamap(bbox, zoom = 11, maptype = 'stamen_toner_lite')) +
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank()) +
  labs(title = 'Arrests in NYC')
## i © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
# Call 'map' by name to see the plot
map

In the Lab, we explored a few ways to plot densities on a map using geom_point(), geom_density_2d(), and geom_density_2d_filled(). Run and study the code below to see how to add this type of geom to your map. Notice that we’re giving a dataframe as well as an aes() mapping to the geom specifically. The argument alpha = controls the transparency of the geom and ranges from 0 (fully transparent) to 1 (fully opaque). We’re also using logical indexing to extract the rows of arrests that only contain offenses (OFNS_DESC) related to dangerous weapons.

offense_type = 'DANGEROUS WEAPONS'

map +
  geom_density_2d_filled(data = arrests[arrests$OFNS_DESC == offense_type,],
               aes(x = Longitude, y = Latitude),
               alpha = 0.4)

Do you see any “hotspots” where arrests are likely to be made for dangerous weapons?

— Your description of the map here —

What are the possible values that the OFNS_DESC variable can take? The unique() function is very useful when exploring a dataset. Run the unique() function below, passing in the OFNS_DESC column using the $ operator.

unique(arrests$OFNS_DESC)
##  [1] "RAPE"                                
##  [2] "ARSON"                               
##  [3] "SEX CRIMES"                          
##  [4] ""                                    
##  [5] "ASSAULT 3 & RELATED OFFENSES"        
##  [6] "FELONY ASSAULT"                      
##  [7] "DANGEROUS WEAPONS"                   
##  [8] "MISCELLANEOUS PENAL LAW"             
##  [9] "PETIT LARCENY"                       
## [10] "GRAND LARCENY OF MOTOR VEHICLE"      
## [11] "DANGEROUS DRUGS"                     
## [12] "NYS LAWS-UNCLASSIFIED FELONY"        
## [13] "GRAND LARCENY"                       
## [14] "OTHER OFFENSES RELATED TO THEF"      
## [15] "PROSTITUTION & RELATED OFFENSES"     
## [16] "OFFENSES AGAINST PUBLIC ADMINI"      
## [17] "ROBBERY"                             
## [18] "FOR OTHER AUTHORITIES"               
## [19] "CRIMINAL MISCHIEF & RELATED OF"      
## [20] "OTHER TRAFFIC INFRACTION"            
## [21] "FORGERY"                             
## [22] "INTOXICATED & IMPAIRED DRIVING"      
## [23] "MURDER & NON-NEGL. MANSLAUGHTE"      
## [24] "UNAUTHORIZED USE OF A VEHICLE"       
## [25] "BURGLARY"                            
## [26] "OFFENSES INVOLVING FRAUD"            
## [27] "VEHICLE AND TRAFFIC LAWS"            
## [28] "OFF. AGNST PUB ORD SENSBLTY &"       
## [29] "THEFT OF SERVICES"                   
## [30] "CRIMINAL TRESPASS"                   
## [31] "POSSESSION OF STOLEN PROPERTY"       
## [32] "OTHER STATE LAWS (NON PENAL LA"      
## [33] "FRAUDULENT ACCOSTING"                
## [34] "OFFENSES AGAINST THE PERSON"         
## [35] "FRAUDS"                              
## [36] "OTHER STATE LAWS"                    
## [37] "THEFT-FRAUD"                         
## [38] "INTOXICATED/IMPAIRED DRIVING"        
## [39] "GAMBLING"                            
## [40] "HOMICIDE-NEGLIGENT-VEHICLE"          
## [41] "ADMINISTRATIVE CODE"                 
## [42] "BURGLAR'S TOOLS"                     
## [43] "ALCOHOLIC BEVERAGE CONTROL LAW"      
## [44] "ENDAN WELFARE INCOMP"                
## [45] "PARKING OFFENSES"                    
## [46] "OFFENSES AGAINST PUBLIC SAFETY"      
## [47] "JOSTLING"                            
## [48] "MOVING INFRACTIONS"                  
## [49] "KIDNAPPING & RELATED OFFENSES"       
## [50] "KIDNAPPING"                          
## [51] "HOMICIDE-NEGLIGENT,UNCLASSIFIE"      
## [52] "AGRICULTURE & MRKTS LAW-UNCLASSIFIED"
## [53] "OFFENSES RELATED TO CHILDREN"        
## [54] "ANTICIPATORY OFFENSES"               
## [55] "HARRASSMENT 2"                       
## [56] "DISORDERLY CONDUCT"                  
## [57] "OTHER STATE LAWS (NON PENAL LAW)"    
## [58] "CHILD ABANDONMENT/NON SUPPORT"       
## [59] "ESCAPE 3"                            
## [60] "LOITERING/GAMBLING (CARDS, DIC"      
## [61] "KIDNAPPING AND RELATED OFFENSES"     
## [62] "ADMINISTRATIVE CODES"                
## [63] "FELONY SEX CRIMES"                   
## [64] "NEW YORK CITY HEALTH CODE"

Copy+paste the code chunk above containing geom_density_2d_filled, and change the ‘DANGEROUS WEAPONS’ condition to a different type of offense. Try it with more than one type of geom, and play with the geoms options (like alpha, linewidth), or change the color palette (e.g. ) - to get a data map that you’re happy with. Describe the resulting map - is it similar or different?

# Store the offense type value in a variable so it's easy to reuse with multiple geoms
offense_type = 'KIDNAPPING'

# Plotting code goes here
map +
  geom_density_2d_filled(data = arrests[arrests$OFNS_DESC == offense_type,],
               aes(x = Longitude, y = Latitude),
               alpha = 0.4)

— Your description of the map here —

Part 2: Exporting figures out of R

So far, all of our plots have shown up in the Plots panel, or as output in the R Markdown document. Eventually you’ll want to get figures out of RStudio and into a file that you can place in a Word document or a Powerpoint/Google slide, for example. Since we didn’t get to cover it in lab, I’ve written some code below that will export any R plot to a file on your computer. There is a function ggsave() for ggplots specifically, but the method below will work for any plot.

The steps are somewhat un-intuitive at first, but here’s how they go: (1) create an empty image file with a function specific to the file type - this turns on the “graphics device” (2) render your plot with print(my_plot) - this writes the image to the file (3) turn off the “graphics device”

# Run getwd() below, which stands for "get working directory", this isn't necessary
# but it can be handy to know exactly where your figure file is going to be saved
getwd()
## [1] "Z:/je2527/Classes/2024_SP ^TA^ Lab in Justice Data Science/Lab 5 - 2.21/assignment"
# Run the line below, which creates a blank PNG image
png('test_image.png', units = 'in', width = 5, height = 4, res = 100)

Look in the folder on your computer indicated by getwd() and you should have a blank image called test_image.png.

Now, let’s actually export a plot. First, though, we need a plot stored in the environment. Copy+paste your final map code from Part 1 above, and store the plot into a new variable called my_map.

my_map = map +
  geom_density_2d_filled(data = arrests[arrests$OFNS_DESC == offense_type,],
               aes(x = Longitude, y = Latitude),
               alpha = 0.4)

Run the code chunk below to save your plot to both .png (raster) and .pdf (vector) files. NOTE: this won’t work if you run line-by-line; so run the whole chunk at once by clicking the green arrow or clicking anywhere inside the chunk and hitting Shift+Cmd+Return (Mac) or Shift+Ctrl+Enter (Windows)

# Open a "graphics device" - here let's to save to a .png
png('my_map_figure.png', units = 'in', width = 5, height = 4, res = 100)

# This displays the ggmap you just made
print(my_map)

# This is necessary to save the plot to the file (don't ask me why)
# it seems you have to run the code chunk all at once for this to work
# (can't run line-by-line)
dev.off()
## png 
##   2
# Now save to a pdf
pdf('my_map_figure.pdf', width = 5, height = 4)
print(my_map)
dev.off()
## png 
##   2

Take a moment to view both image files on your computer. Try zooming in on both images. Does any part of the image get pixelated in either image as you zoom? What about text (like the plot title)? Are you able to click and highlight text in the .pdf file?

– Your observations on file type differences here –

Part 3: Cracking the Soho cholera outbreak case using data maps

The year is 1854. You have just arrived in Soho, London after receiving a distress signal from Dr. John Snow, leader of the “Guardians of Inter-dimensional Data”. Snow has summoned you to assist with a deadly disease outbreak that threatens to rewrite the course of history on Earth. Cholera, which causes diarrhea and life-threatening dehydration, has reached epidemic proportions in London. Snow’s hypothesis is that deaths from cholera are linked to the neighborhood’s water supply, which is provided by a number of pumps scattered throughout the area. After your briefing, Snow gives you his data core containing the number and locations of deaths, the locations of the pumps, and a map of Soho. These files seem to be in an alien format, but luckily, before traveling through the time portal to 1854 with your laptop, you installed the R packages sf and raster by running install.packages(c('sf','raster')) in the console.

Snow tells you that the data are contained within shapefiles, an ancient technology that is poorly understood. However, special functions in the sf and raster packages can read these files and even visualize them with proper equipment. He instructs you to run the lines below.

library(sf)
## Warning: package 'sf' was built under R version 4.1.3
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 7.2.1; sf_use_s2() is TRUE
library(ggplot2)
library(raster)
## Warning: package 'raster' was built under R version 4.1.3
## Loading required package: sp
deaths = st_read('SnowData/Cholera_deaths.shp')
## Reading layer `Cholera_deaths' from data source 
##   `Z:\je2527\Classes\2024_SP ^TA^ Lab in Justice Data Science\Lab 5 - 2.21\assignment\SnowData\Cholera_deaths.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 250 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 529160.3 ymin: 180857.9 xmax: 529655.9 ymax: 181306.2
## Projected CRS: OSGB 1936 / British National Grid
pumps = st_read('SnowData/Pumps.shp')
## Reading layer `Pumps' from data source 
##   `Z:\je2527\Classes\2024_SP ^TA^ Lab in Justice Data Science\Lab 5 - 2.21\assignment\SnowData\Pumps.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 8 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 529183.7 ymin: 180660.5 xmax: 529748.9 ymax: 181193.7
## Projected CRS: OSGB 1936 / British National Grid
map = raster('SnowData/SnowMap.tif')

map = as.data.frame(map, xy = TRUE)

Snow then connects your laptop to a holographic visualizer, and while doing so, describes a special geom called geom_sf() that works nicely with data contained in shapefiles. To your surprise, he tells you that linking the data to X and Y coordinates using aes() is not necessary, as geom_sf() interprets the geometry columns of shapefiles as coordinates. He encourages you to look at the structure of deaths and pumps by clicking on them in your Environment, and then to run the code chunk below.

ggplot() +
  geom_sf(data = deaths, aes(size = Count)) +
  geom_sf(data = pumps, shape = 21, size = 4, color = 'black', fill = '#35cc97')

Inquisitively, you turn to Snow, then point to the strange code fill = '#35cc97' above. He calls this representation of color a hex code and tells you how to obtain them. You connect to Inter-dimensional Wi-fi and Google “HTML Color Picker”. An interactive panel appears, with a “HEX” box below containing a # followed by a 6-letter/digit code, which you can copy/paste into RStudio. You then point to the part that says shape = 21. Snow describes how “point” geoms, or “markers”, have many different available shapes, which are identified by a number. You Google “R Marker Shapes” and find that the numbers 0-25 represent different shapes such as circles, squares, and triangles. Some of them have color and fill properties, some of them only color. Play around with HEX colors, marker shapes, and sizes to customize the plot above. NOTE: none of these arguments should be inside aes()!

Snow, getting frustrated with the amount of time you’re spending beautifying this plot, reminds you of the reason you are here - to find the source of the disease outbreak!! Looking at your plot (which is beautiful), you and Snow are both convinced that the pump at the center of the deaths cluster must be related to the outbreak. You tell Snow that surely this figure is evidence enough for the Inter-dimensional Council to take action. He refuses, exclaiming that the Council will not be persuaded unless you quantitatively show that this is the correct pump. He rubs his forehead and mutters that there must be some way to quantify which pump is central to the outbreak.

Eureka! You have an idea: because these data are spatial - you could find the pump that minimizes the total (summed) distance to each of the recorded deaths! Snow’s eyes widen. He turns a hopeful glance towards you.

You show Snow how you can easily find the distance between a single death and a single pump with the code below:

st_distance(deaths[1,'geometry'], pumps[1,'geometry'])
## Units: [m]
##          [,1]
## [1,] 88.02289

You then show Snow how to find death-pump distances for multiple locations by changing deaths[1,'geometry'] to deaths[1:5,'geometry']. Try this, then try leaving the “row position” blank: deaths[,'geometry']. Tell me what the output shows you.

— Your description here —

Snow jumps up and down in excitement. You describe to him an algorithm for finding the pump that is nearest to all deaths: 1. initiate a list of distances starting at 0 for each pump 2. for each pump i: + 1. compute the distance from pump i to each row of deaths + 2. store the sum of those distances in the ith element of distances 3. the pump corresponding to the minimum of distances is the central pump

(Jacob here: see if you can implement this algorithm. There are many other ways to go about this - feel free to experiment. distances should be a numeric vector of length 8. It should be all 0’s at first, then all large numbers at the end. The functions nrow(), numeric(), sum(), and st_distance() may be useful. The for loop isn’t necessary but may be useful. If you get stuck, check the A5hints.txt file on Courseworks)

# Compute the summed distances from each pump to all death locations
n_pumps = nrow(pumps)
distances = numeric(n_pumps)

for (i in 1:n_pumps) {
  
  pump_dist = st_distance(deaths[,'geometry'], pumps[i,'geometry'])
  
  distances[i] = sum(pump_dist)
  
}

# Show the list of summed distances
distances
## [1]  31509.16  61129.79  75082.31  91563.54  66150.23  59459.83 107222.49
## [8]  71052.50

What is the information contained in the variable distances?

— Your description here —

To find the pump that minimizes the distances to all deaths, you want the index or position of the minimum value in distances (i.e. an integer from 1-8). Try to get this on your own by googling, or see the A5hints.txt file.

# Find the index of the minimum element of distances
closest_pump_idx = which(distances == min(distances))

What is the value of closest_pump_idx? What information does this give you?

— Your description here —

Having quantitatively found which pump is the probable cause for the outbreak, you and Snow prepare a visualization for the Inter-dimensional Council. The chunk below will plot deaths and pumps on top of Snow’s original map, with an additional marker for the pump you’ve identified as the culprit. Customize it however you like!

ggplot() +
  geom_raster(data = map, aes(x = x, y = y, fill = SnowMap_1)) +    # first 2 lines draw Snow's original map
  scale_fill_gradient(low = 'black', high = 'white') +              # makes sure map color is black & white
  geom_sf(data = deaths, aes(size = Count * 1.1), shape = 21, color = 'black', fill = '#77a35b') +
  geom_sf(data = pumps, color = 'black', fill = '#7af5f1', size = 4, shape = 25) +
  geom_sf(data = pumps[closest_pump_idx,], 
               color = 'black', fill = '#ff54ee', size = 5, shape = 25)

Congratulations! By locating the pump, you’ve ended an epidemic, saved countless lives, and sky-rocketed John Snow’s career. You step into the time portal to return home, knit this file to HTML, and submit to Courseworks.

Part 4: Your feedback

On a scale of 1-5, with 1 being “too easy”, 5 being “too hard”, and 3 being “a good level”, how difficult was this assignment?

Approximately how much time did you spend working on this assignment? (i.e. actively solving the problem, not exploring the data independently)

Any thoughts on what you found useful, or what you found mundane, or confusing? (anything, big or small)